ppoi improvement plan
2018-10-24
ppoi's suggestions for improvement, I'll note them here because it's distracting if I start fixing them while I'm using it. I don't understand why you want me to put unknown.txt.
I don't know what exactly you're going to do.
OCR'd data for 5 books cat 5 OCR'd scanned PDFs of book cuttings.
$ wc unknown.txt
35203 110327 2069652 unknown.txt
code:python
>> lines = open("unknown.txt").readlines()
>> len(lines)
25762
>> open("unknown.txt", "w").writelines(lines)
>> from collections import Counter
>> c = Counter()
>>> for line in lines:
c.update(line)
>> c.most_common(100)
...
>> len(c)
2851
>> len([1 for k in c if ck > 9]) 1513
1513
(Future feature to be added) It says to put at least one positive and one negative at initialize, but you don't have to put them, you can just assume that "if no data is entered for both sides, then everything is 0.5" and proceed to active learning.
(consideration) I assume it's supposed to be run on the command line, but it could have been run on ipython?
Can ipython take input?
I don't think you can tell them to put ppoi/unknown.txt.
Since it is an OCR-ized judgment this time
code::
$ cat ppoi/positive.txt
|11111'「 11
$ cat ppoi/negative.txt
Without good interviews, no matter how deftly you put them together, you will never know how to reach a good decision.
Change feature creation in user.py
code:python
CHARS = open("chars.txt").read()
def make_features(s):
"take a string, return np.array"
# normalize
x = x / x.sum()
return x
It would be nice to have undo in case of wrong input during active learning
Now I have to open and edit two files.
During active learning, you want to add more clusters.
In this case, besides Yes/No, there's also "This is too short to make a decision.
---
This page is auto-translated from /nishio/ppoi改善案. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.